Introduce async scheduler implementation with mixin pattern by GOavi101 · Pull Request #941 · torch-spyre/sendnn-inference

GOavi101 · 2026-04-21T08:10:51Z

Description

Introduce async scheduler implementation with mixin pattern for cleaner architecture.

New Implementation (mixins)

PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes
Runtime detection via _is_async_scheduler() (isinstance check)
Simple multiple inheritance for concrete classes:
- class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler):
- class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler):
- class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler):
- class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler):

Related Issues

Test Plan

Added comprehensive unit tests in tests/v1/core/test_async_scheduler.py (16 tests):
- TestIsAsyncScheduler: Verifies _is_async_scheduler() detection (4 tests)
- TestPoolingSpyreMixinSchedule: Tests warmup-shape constraints in sync/async modes (4 tests)
- TestChunkedPrefillSpyreMixinSchedule: Verifies constraint bypass in async mode (3 tests)
- TestChunkedPrefillSpyreMixinUpdateFromOutput: Tests scheduler output filtering in async mode (5 tests)

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-04-21T08:11:01Z

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

joerunde · 2026-04-22T17:33:55Z

-    SchedulerOutput = None
-
-logger = init_logger(__name__)
+from vllm_spyre.v1.core.scheduler_impl import (


@GOavi101 it looks like most of this file has been deleted and moved to scheduler_impl. Can you put the implementation back in this file so that reviewers can see what's changed?

Thanks, I've looked through the tests but I'll wait to review the code changes until after this diff is in nicer shape- I don't really want to try to recreate the diff myself 😉

joerunde · 2026-04-22T17:54:48Z

-        return EMPTY_MODEL_RUNNER_OUTPUT
+        cached = self._last_execute_model_output
+        self._last_execute_model_output = None
+        return cached if cached is not None else EMPTY_MODEL_RUNNER_OUTPUT


Ideally we would actually run the sampling here - see related comment on the structured output PR: #903 (comment)

I'm fine with leaving this as-is and then fixing it to work with both async scheduling and structured outputs in a followup. Issue opened here: #947

Agreed, thanks for opening the issue. I've added a TODO(#947) comment pointing to it so it's tracked directly in the code.

joerunde · 2026-04-22T20:17:58Z

+Key behaviours under test:
+  - _is_async_scheduler() correctly identifies async vs sync instances
+  - PoolingSpyreMixin.schedule() applies warmup-shape constraints in both modes
+  - ChunkedPrefillSpyreMixin.schedule() bypasses Spyre constraints in async mode


This statement seems incorrect- we definitely can't just bypass spyre constraints because there are hard limits to what we can run on the cards. What's really going on?

Correct, nothing is bypassed — that docstring was wrong. same constraints apply in both modes. only async-specific code is a stale ongoing_prefills cleanup needed because _update_after_schedule speculatively advances num_computed_tokens before update_from_output() confirms it. fixed the docstring.

joerunde · 2026-04-22T20:29:20Z

+                is_pooling=True,
+            )
+            # Set as string path for vLLM's resolution (matches upstream behavior)
+            # Only convert to string if it's not already a string


a class should be fine to pass here though, what goes wrong?

Nothing goes wrong — you're right. SchedulerConfig.scheduler_cls is typed str | type | None and get_scheduler_cls() handles a class directly. string conversion was unnecessary. removed it in the latest push.

joerunde · 2026-04-22T20:30:05Z

+            # The mixin's pre-filter pattern is not safe under that run-ahead scenario.
+            # For TP=1 (UniProcExecutor), futures are immediately done so it's safe.
+            if parallel_config.world_size > 1:
+                scheduler_config.async_scheduling = False


Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?

@joerunde

The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

what do you think?

That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.

Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose

I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.

This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:

The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?

You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.

joerunde · 2026-04-22T20:58:24Z

Thanks @GOavi101!

A few notes:

If this can't be done with tensor parallel, then maybe it's not worth pursuing. Is that a hard blocker?
We need to have an end-to-end test that shows this working, ie using an LLM with async scheduling enabled. It would also be good to include an illustrative test at the engine level (see https://github.com/torch-spyre/sendnn-inference/blob/main/tests/e2e/test_spyre_pc_scheduler_steps.py) that shows the effects of async scheduling. From my quick skim it sounds like the engine is speculatively scheduling batches one step ahead, so we should see a "dead token" in some cases where the engine schedules a decode past the end of a sequence.
It would be really great to see a profile of this in action, or at least some minimal vllm bench results showing what kind of performance improvement we can expect.

GOavi101 · 2026-04-23T10:35:45Z

After vLLM 0.14, the async scheduler is enabled by default. All the tests below are running using the async scheduler. To run with the synchronous scheduler:
we have to add the --no-async-scheduling flag.

Replace _create_pooling_scheduler() and _create_chunked_prefill_scheduler() factory functions with PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes. Each mixin uses _is_async_scheduler() (isinstance check) to detect the concrete base class at runtime and adjust behaviour accordingly, instead of capturing is_async via a closure variable. Concrete classes use simple multiple inheritance: class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler): pass class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler): pass class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler): pass class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler): pass Side effects: - __module__/__name__/__qualname__ fixup blocks removed (no longer needed) - _async_warning_logged flag removed (debug log emitted each call is fine) - TYPE_CHECKING import removed (unused after refactor) Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

GOavi101 requested review from dilipgb and joerunde April 21, 2026 08:10

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 15 times, most recently from 1a3ecbb to b0e8e83 Compare April 22, 2026 17:20

joerunde reviewed Apr 22, 2026

View reviewed changes

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch from b0e8e83 to d71cfb3 Compare April 22, 2026 17:34

joerunde reviewed Apr 22, 2026

View reviewed changes

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 5 times, most recently from 1bd875b to 2246d48 Compare April 23, 2026 10:03

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 3 times, most recently from 7b0a718 to fb7ee62 Compare April 23, 2026 20:09

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch from fb7ee62 to c5db31a Compare April 24, 2026 05:21

Conversation

GOavi101 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Implementation (mixins)

Related Issues

Test Plan

Checklist

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

joerunde Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GOavi101 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GOavi101 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GOavi101 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde commented Apr 22, 2026

Uh oh!

GOavi101 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GOavi101 commented Apr 21, 2026 •

edited

Loading

joerunde Apr 22, 2026 •

edited

Loading

GOavi101 Apr 23, 2026 •

edited

Loading

GOavi101 Apr 23, 2026 •

edited

Loading

GOavi101 Apr 23, 2026 •

edited

Loading

GOavi101 commented Apr 23, 2026 •

edited

Loading